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Abstract — Group model selection is the problem of determin- 
ing a small subset of groups of predictors (e.g., the expression 
data of genes) that are responsible for majority of the variation 
in a response variable (e.g., the malignancy of a tumor). This 
paper focuses on group model selection in high-dimensional 
linear models, in which the number of predictors far exceeds 
the number of samples of the response variable. Existing works 
on high-dimensional group model selection either require the 
number of samples of the response variable to be significantly 
larger than the total number of predictors contributing to the 
response or impose restrictive statistical priors on the predictors 
and/or nonzero regression coefficients. This paper provides 
comprehensive understanding of a low-complexity approach to 
group model selection that avoids some of these limitations. 
The proposed approach, termed Group Thresholding (GroTh), 
is based on thresholding of marginal correlations of groups 
of predictors with the response variable and is reminiscent 
of existing thresholding-based approaches in the literature. 
The most important contribution of the paper in this regard 
is relating the performance of GroTh to a polynomial-time 
verifiable property of the predictors for the general case of 
arbitrary (random or deterministic) predictors and arbitrary 
nonzero regression coefficients. 

I. Introduction 

A. Motivation and Background 

One of the most fundamental of problems in statistical 
data analysis is to learn the relationship between the samples 
of a dependent or response variable (e.g., the malignancy 
of a tumor, the health of a network) and the samples of 
independent or predictor variables (e.g., the expression data 
of genes, the traffic data in the network). This problem was 
relatively easy in the data-starved world of yesteryears. We 
had n samples and p predictors, and our inability to observe 
too many variables meant that we lived in the "n greater than 
or equal to p" world. Times have changed now. The data-rich 
world of today has enabled us to simultaneously observe an 
unprecedented number of variables per sample. It is nearly 
impossible in many of these instances to collect as many, 
or more, samples as the number of predictors. Imagine, for 
example, collecting hundreds of thousands of thyroid tumors 
in a clinical setting. The "n smaller than p" world is no 
longer a theoretical construct in statistical data analysis. It 
has finally arrived; and it is here to stay. 

This paper concerns statistical inference in the "n smaller 
than p" setting for the case when the response variable 
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depends linearly on the predictors. Mathematically, a model 
of this form can be expressed as 

p 

Vi = ^ ZijPj + £ i> i = l,-..,n. (1) 
i=i 

Here, j/j denotes the i-th sample of the response variable, 
Xij denotes the i-th sample of the j-th predictor, ei denotes 
the error in the model, and the parameters are called 

regression coefficients. This relationship between the samples 
of the response variable and those of the predictors can be 
expressed compactly in matrix-vector form as y = X(3° + e. 
The matrix X in this form, termed the design matrix, is an 
n X p matrix whose j-th column comprises the n samples 
of the j-th predictor. In tumor classification, for example, 
an entry in the response variable y could correspond to the 
malignancy (expressed as a numerical number) of a tumor 
sample, while the corresponding row in X would correspond 
to the expression level of p genes in that tumor sample. 

The linear model y = Xf3° + e, despite its mathematical 
simplicity, continues to make profound impacts in countless 
application areas [1], Such models are used for various 
inferential purposes. In this paper, we focus on the problem 
of model selection in high-dimensional linear models, which 
involves determining a small subset of p predictors that 
are responsible for majority (or all) of the variation in 
the response variable y. High-dimensional model selection 
can be used to implicate a small number of genes in the 
development of cancerous tumors, identify a small number 
of genes that primarily affect prognosis of a disease, etc. 

B. Group Model Selection and Our Contributions 

There exist many applications in statistical model selection 
where the implication of a single predictor in the response 
variable implies presence of other related predictors in the 
true model. This happens, for instance, in the case of 
microarray data when the genes (predictors) share the same 
biological pathway [2]. In such situations, it is better to 
reformulate the problem of model selection in a "group" 
setting. Specifically, the response variable y = X/3° + e in 
high-dimensional linear models in group settings can be best 
explained by a small number of groups of predictors: 

m 

y = ^X i B9+e = ^2x i 0S+e, (2) 

i=l ieK 



Algorithm 1 The Group Thresholding (GroTh) Algorithm for Group Model Selection 

Input: An n x p design matrix X, response variable y, number of predictors per group r, and (group) model order k 
Output: An estimate JC C {1, . . . , m} of the true (group) model JC 

/ <— \Xi X2 ■ ■ ■ X m ] T y {Compute marginal correlations} 

(I,{\\f(j)h}) <-SOKT^{l,... > m},{||/ i || 2 := \\X?y\\ 2 })) {Sort groups of marginal correlations} 

JC <— X[l : k] {Select model via group thresholding} 



where Xi, an n x pi submatrix of X, denotes the z-th 
group of predictors, fif denotes the group of pi regression 
coefficients associated with the group of predictors Xi, and 
the set JC := {1 < i < m : /3® 7^ 0} denotes the underlying 
true (group) model, corresponding to the k := \JC\ <C m 
groups of predictors that explain y. 

One of the main contributions of this paper is com- 
prehensive understanding of a polynomial-time algorithm, 
which we term Group Thresholding (GroTh), that returns 
an estimate JC of the true (group) model JC for the general 
case of arbitrary (random or deterministic) design matrices 
and arbitrary nonzero regression coefficients. To this end, 
we make use of two computable geometric measures of 
group coherence of a design matrix — the worst-case group 
coherence fi 9 x and the average group coherence v 9 x — to 
provide a nonasymptotic analysis of GroTh (Algorithm [T]). 
We in particular establish that if X satisfies a verifiable 
group coherence property then, for all but a vanishingly small 
fraction of possible models JC, GroTh: (i) handles linear 
scaling of the total number of predictors contributing to the 
response, J^ieicPi = and (it) returns indices of the 

groups of predictors whose contributions to the response, 
{ll/^IWie/Ci are above a certain self-noise floor that is a 
function of both pP x and ||/3°||2- 

C. Relationship to Previous Work 

The basic idea of using grouped predictors for inference 
in linear models has been explored by various researchers 
in recent years. Some notable works in this direction in 
the "n <C p" setting include [3]-[9]. Despite these inspiring 
results, more work needs to be done for high-dimensional 
group model selection. This is because the results reported 
in [3]-[9] do not guarantee linear scaling of the total number 
of predictors contributing to the response for the case of 
arbitrary design matrices and nonzero regression coefficients. 

The work in this paper is also related to another body 
of work in statistics and signal processing literature that 
studies the high-dimensional linear model y = X(3° + e 
for the restrictive case of X having a Kronecker structure: 
X := A T <S) I for some matrix A, where (£> denotes the 
Kronecker product. An incomplete list of works in this di- 
rection includes [10]— [16]. These restrictive works, however, 
also fail to guarantee linear scaling of the total number 
of predictors contributing to the response for the case of 
arbitrary nonzero regression coefficients. 

'Recall that f(n) = 0(g(n)) if there exists positive C and no such 
that for all n > no, f(n) < Cg(n). Also, f(n) = Q(g(n)) if g(n) = 
0(/(n)), and /(n) = &(g(n)) if /(n) = 0(g(n)) and g(n) = 0(/(n)). 



Finally, note that the group model selection procedure 
studied in this paper is based on analyzing the marginal 
correlations, X T y, of predictors with the response variable. 
Therefore, our work is algorithmically similar to the group 
thresholding approaches of [8], [13], [14]. The main appeal 
of such approaches is their low computational complexity of 
0(np), which is much smaller than the typical computational 
complexity associated with other model selection procedures 
[17]. In addition to the scaling limitations of the total number 
of influential predictors discussed earlier, however, the works 
in [8], [13], [14] also incorrectly conclude that performance 
of thresholding-based approaches is inversely proportional to 
the dynamic range, ygSy^ . of the nonzero groups of 

regression coefficients. 

D. Mathematical Convention 

The predictors and the response variable are assumed to be 
real valued throughout the paper, with the understanding that 
extensions to a complex- valued setting can be carried out in 
a straightforward manner. Uppercase letters are reserved for 
matrices, while lowercase letters are used for both vectors 
and scalars. Constants that do not depend upon the problem 
parameters (such as n, m, p, and k) are denoted by Co, 
Ci, etc. The notation fqj for q e N is a shorthand for 
the set {l,...,q}, while the notation = signifies equality 
in distribution. The transpose operation is denoted by (-) T 
and the spectral norm of a matrix is denoted by || • H2. 
Finally, the l p>q norm of a vector v T = [vj . . . v^A with 

each vi <G K r is defined as := IK'llp) 1 ^ 9 f° r 

p,q G (0, 00], where || ■ || p denotes the usual £ p norm. Note 
that IMIp.oo = maxi \\vi\\ p and |H| P , 9 = \\v\\ q for r = 1. 

E. Organization 

In Section [II] we mathematically formulate the problem 
of group model selection, rigorously define the notions of 
worst-case group coherence, average group coherence and 
the group coherence property, and state and discuss the main 
result of the paper. In Section [HI] we prove the main result 
of the paper. Finally, we present some numerical results in 
Section [IV] and conclude in Section [V] 

II. Group Model Selection Using GroTh 

A. Problem Formulation 

The object of attention in this paper is the high- 
dimensional linear model y = X/3° + e relating the response 
variable y £ K™ to the p (>> n) predictors comprising 
the columns of the design matrix X. Since scalings of the 



columns of X can be absorbed into the regression vector /3°, 
we assume without loss of generality that the columns of X 
have unit I2 norms. There are three simplifying assumptions 
we make in this paper that will be relaxed in a sequel to 
this work. First, the modeling error is zero, e = 0, and 
thus the response variable is exactly equal to a parsimonious 
linear combination of grouped predictors: y = J2ieK ■^■iPi- 
Second, the groups of predictors 1 are characterized 

by the same number of predictors per group: Xi € R nxr 
with r := — < n. Third, the groups of predictors {^j™ 1 
are orthonormalized: Xj Xi = I. 

The main goal of this paper is characterization of the 
performance of a group model selection procedure, termed 
GroTh, that returns an estimate JC of the true model JC 
by sorting the groups of marginal correlations fa := Xjy 
according to their ^2-norms, ||/j||2, in descending order and 
setting JC to be indices of the first k sorted groups of 
marginal correlations (see Algorithm [T). Instead of focusing 
on the worst-case performance of GroTh, however, we seek 
to characterize its performance for an arbitrary (but fixed) 
set of nonzero (grouped) regression coefficients supported on 
most models. Specifically, we do not impose any statistical 
prior on the set of nonzero regression coefficients, while we 
assume that the true (group) model JC := {i <G [m] : /3? 7^ 0} 
is a uniformly random fc-subset of [to]. Finally, the metrics 
of goodness we use in this paper are the false-discovery 
proportion (FDP) and the non-discovery proportion (NDP), 
defined as 



FDP(£) 



\JC\ 



and NDP(/C) 



\K\ 



(3) 



respectively. These two metrics have gained widespread 
usage in multiple hypotheses testing problems in recent 
years. In particular, the expectation of the FDP is the well- 
known false-discovery rate (FDR) [18], [19]. 

B. Main Result and Discussion 

Heuristically, successful group model selection requires 
the groups of predictors contributing to the response variable 
to be sufficiently distinguishable from the ones outside the 
true model. In this paper, we capture the notion of distin- 
guishability of predictors through two easily computable, 
global geometric measures of the design matrix, namely, the 
worst-case group coherence and the average group coher- 
ence. The worst-case group coherence of X is defined as 



jj, 9 x := max 



xjx 



ill2i 



while the average group coherence of X is defined as 
1 



g ._ 



max 



TO — lie [m] 



E 



xjx, 



(4) 



(5) 



Note that fi 9 x is a trivial upper bound on u x . It is also worth 



pointing that the worst-case group coherence and its variants 
have existed in earlier literature [7], [20], but the average 
group coherence is defined for the first time in here. 



The central thesis of this paper is that group model selec- 
tion using GroTh can be successful if these two measures of 
group coherence of X are small enough. In particular, we 
address the question of how small should these two measures 
be in terms of the group coherence property. 

Definition 1 (The Group Coherence Property). The n x rm 
design matrix X is said to satisfy the group coherence 
property if the following two conditions hold for some 
positive constants and c v : 



yTogro 



and 



9 ^ Q 

v\ < c v \x\ 



r log to 



(GroCP-1) 



(GroCP-2) 



It is straightforward to observe from the above definition 
that the group coherence property is a global property of X 
that can be explicitly verified in polynomial time. Finally, we 
define /3?^ to be the £-th largest group of nonzero regression 
coefficients: \\^ 1} \\ 2 > ||/3° 2) || 2 > ••• > ||/3° fe) || 2 > 0. We 
are now ready to state the main result of this paper. 

Theorem 1 (Group Model Selection Using GroTh). Suppose 
the design matrix X satisfies the group coherence property 
with parameters c^ and c v . Next, fix parameters c\ > 2, c 2 € 



32y / 2e(2ci-l) 



Then, under 



(0, 1), and define parameter C3 .— (i^ C2 )( Cl -i) 
the assumptions c\rk < n, c M < c^ 1 , and c„ < y / CiC2C3, 
we have with probability exceeding 1 — e 2 TO _1 that 



{l eIC: ||/3.°||2>c 3M ^||/3 || 2 yioi^} 



C K, (6) 



resulting in FDP(/C) < 1 - L/k and NDP(/C) < 1 - L/k, 
where L is defined to be the largest integer for which the 
inequality \\^? L \\\2 ^ c 3 1 1 1 1 2 "s/log holds. Here, the 
probability is with respect to the uniform distribution of the 
true model JC over all possible models. 

A proof of this theorem is given in Section [III] We now 
provide a brief discussion of the significance of this result. 
First, Theorem Q] indicates that a polynomial-time verifiable 
property, namely, the group coherence property, of the design 
matrix can be checked to ascertain whether GroTh, which 
has computational complexity of 0(np), is well suited for 
group model selection. Second, it states that if X satisfies 
the group coherence property then GroTh handles linear 
scaling of the total number of predictors contributing to the 
response, rk = 0(n), for all but a vanishingly small fraction 
0(77i _1 ) of models. This is in stark contrast to the earlier 
works [8], [13], [14] on thresholding-based approaches in 
high-dimensional linear models, which do not guarantee such 
linear scaling for the case of arbitrary nonzero regression 
coefficients. Note that while we do not provide in this paper 
explicit examples of design matrices satisfying the group 
coherence property, numerical results in Section [TV] show 
that the set of design matrices satisfying the group coherence 
property is not empty. 

Finally, Theorem Q] offers a nice interpretation of the price 
one might have to pay in estimating the true model using only 



marginal correlations. Specifically, (O in the theorem implies 
group thresholding of marginal correlations effectively gives 
rise to a self-noise floor of O (/z^-||/3 ||2V / logTO). In words, 
the estimate JC returned by GroTh is guaranteed to return the 
indices of all the groups of predictors whose contributions 
to the response variable (in the I2 sense) are above the 
self-noise floor of O (/i^||^°||2\/logm) (cf. [6}. This is 
again a significant improvement over the earlier works [8], 
[13], [14], which suggest that performance of thresholding- 
based approaches is inversely proportional to the dynamic 

max ie /c ||f3,°||2 c iu c 

range, — : — 5 — " ^ of the nonzero groups of regression 
coefficients. In order to expand on this, we observe from 
© that 



/3?|| 2 = n ^||/3°|| 2V Wi 



1/3° 



li/fc 



log m) 



(7) 



Theorem Q] and the left-hand side of (0 indicate that inclu- 
sion of the z-th group of predictors in the estimate JC is in 
fact related to the ratio of the energy contributed by the i-th 

group of predictors to the average energy contributed per 

IIS II 2 

group of nonzero predictors: ,, gi 2 ,. . Further, this implies 

IIP Ib/^' 

that an increase in the dynamic range that comes from a 
decrease in minigK: 11/3? II 2 cannot affect the performance of 

II fl0||2 

GroTh too much since n^oip/fc increases for most groups 
of predictors in this case. This is indeed confirmed by the 
numerical experiments reported in Section [TV] 

III. Proof of the Main Result 

We begin by developing some notation to facilitate the 
forthcoming analysis. Notice that the p-dimensional vector 
of marginal correlations, / = X T y, can be written as 
m groups of r-dimensional marginal correlations: / T = 
[fi ■ ■ ■ fm] witn the r x 1 vector /j = Xjy. In the 
following, we use X/c (an n x rk submatrix of X), (an 
rk x 1 subvector of /3°), and f K := Xj-y = X^X^ 
(an rk x 1 subvector of /) to denote the groups of pre- 
dictors, groups of regression coefficients, and the marginal 
correlations corresponding to the true model JC, respectively. 
Similarly, we use X^ and fcc := X^ c y = X^ c X/c/3^ to 
denote the groups of predictors and the marginal correlations 
corresponding to the complement set JC C := [m] \ JC, 
respectively. 

A. Lemmata 

Proof of Theorem Q] requires understanding the behaviors 
of the rk x 1 group vector (X^X/c — /) 0^ and the r(m — 
k) x 1 group vector X^X^ji^. In this subsection, we state 
and prove two lemmas that help us toward this goal. We 
will then leverage these two lemmas to provide a proof of 
Theorem Q] 

Before proceeding, recall that JC is taken to be a uni- 
formly random fc-subset of [to], while the set of nonzero 
group regression coefficients {zi}^ =1 := : i G JC} 

is considered to be deterministic (and fixed) but unknown. 



It therefore follows that the rfc-dimensional group vector 
(X^X/c — I) fix can be equivalently expressed as 



X£X K -l)p K ^(x£X u -l)z, 



(8) 



where II := (iri, . . . , n m ) is a random permutation of 
[to], II := (tti, . . . , 7Tfc) denotes the first k elements of IT, 
X n : 



\X m . . . ^7r fc ] is an n x rk submatrix of X, and 
J] is an rk x 1 (group) vector of nonzero 
regression coefficients. Similarly, the r(m — fc)-dimensional 
group vector X^X^P^ can be expressed as 



(9) 



where IT' := (iik+i, ■ ■ ■ , Km) denotes the last m— k elements 
of fi and Jn= ;= [^7r fc+1 ■ • ■ X Tr m ] is an n x r(m — k) 
submatrix of X. 

Lemma 1. Fix c\ > 2 and e G (0, 1). Next, assume k < 
min{e 2 (^) -2 + 1, c^ 1 m} and let IT = (ni, . . . , 7Tfc) denote 
the first k elements of a random permutation of [to]. Then 
for any fixed rk x 1 group vector z T := [zj . . . zj] 



Pr 



(|| (X^Xn 



< e 2 k cxp ( — C4 (e — v 9 x \/k 



(10) 



where C4 



1024e(2ci-l)' 



is an absolute constant. 



Proof. The proof of this lemma relies heavily on Banach- 
space-valued Azuma's inequality stated in the Appendix. To 
begin, note that 



(X^Xu - I) z\ 



max || y X^ X nj Zj\\ . 
«e[fc] ~[ ' J 112 



We next fix an i G [fc] and define the event A[ := {n \ 
for i' G [&]. Then conditioned on A[, we have 



(11) 



i'} 



Pr 



fc 

^XlX^z^eWzh 



A 



Pr I 



fc 

Y,X^X^ Zj \\ 2 >e\\z\\ 2 



A)- (12) 



In order to make use of the concentration inequality in 
Proposition [TJ in the Appendix for upper bounding ([T2l . we 
construct an Revalued Doob martingale on Yljjti XjX W:j Zj. 
We first define II -1 := (tti, . . . , 7Tj_i, 7r»+i, . . . , ^k) and then 
define the Doob martingale (Mo, M\, . . . , M^-i) as follows: 

fc 

M :=J2 X J K [ X - 3 \A] z i> and 

3=1 
fc 

= XjE[X^. |7r^, ^] ^, e= 1, • . • , fc - 1, 

J=l 



where 7T^\« denotes the first £ elements of II -1 . The next 
step involves showing that the constructed martingale has 
bounded £2 differences. In order for this, we use irj 1 to 
denote the £-th element of and define 



k 

M e {u) ~J2Xjl 



(13) 



for u <G [to] and I = 1, . . . , k — 1. It can then be established 
using techniques very similar to the ones used in the method 
of bounded differences for scalar-valued martingales that 
[21], [22] 

||M<- Af^illa <sup\\M e (u)-M e (v)\\ 2 . (14) 

In order to upper bound ||M^(u)— M.g(u)|| 2 , we first define 
an n x r random matrix 



X£":=E[X Wj |7rrVi>V = «.-4] 



EI, 



w,^]. (15) 



Next, we notice that for every j > £ + 1, j 7^ i, the 
random variable ttj conditioned on {7rj^.,_ , , -kJ % = UjA'j} 
has a uniform distribution over [to] \ {7r^.^_ 1 , u, i'}, while 
7Tj conditioned on {7rr* « j, 7r7 l = u,^} has a uniform 
distribution over [to] \ {'Ki^ e _ 1 ,v,i'}, Therefore, we get 



171 — £ — 1 



{X u -X v ), j>£+l,j^i. (16) 



In order to evaluate X^^ for j < £ + 1, j 7^ i, we consider 
three cases for the index i. In the first case of i < £, it can be 
seen that X%'? = for every j < £ and X?'? = X u — X v for 
j = I + 1. In the second case of i = £ + 1, it can similarly 
be seen that Jfl*'" = for every j < £ and j = £ + 1, 



X u — X v for j = £. In the final case of 



while Xl 



i > £ + 1, it can be argued that X £ l 



for every j < 



X^ v = X u - X v for j = £, and = {X u - X v ) 

for j = £ + 1. Consequently, regardless of the initial choice 
of i, we have 



|M<(u)-M*(i;)|| 



(a) 



E*W*iL < ^||XjX^|| 2 || Z ,|| 2 



< 



2^(ll^|| 2 + ||^+l||2+ ^ 



j||2 



(17) 



j>£+1 



where (a) is due to the triangle inequality and the submul- 
tiplicative nature of the induced norm, while (b) primarily 
follows since ||JfJX u — XjX^I^ < 2/i^-. We now have 
from (O and ([T7J that \\M e - M^_i|| 2 < a t with 



at ■= 2n 9 x (\\ze^2 + ||^+i||2 + E 



J 112 



£-1 



(18) 



The next step needed to upper bound ( fT2l involves pro- 
viding an upper bound on ||Mo|| 2 . To this end, note that 



ii^IIE^G^tE^ 



< 



(rf) 



1 



9^' 



'Jll 2 



(19) 



< 4EH^H 2 ^4Vfc^T|kl| 2l 

where (c) follows since irj conditioned on Ai> has a uniform 
distribution over [to] \ {i 1 } and (d) is a consequence of the 
definition of average group coherence. Finally, we note from 
[23, Lemma B.l] that pe(r) defined in PropositionQ] satisfies 
Pb(t) < r 2 /2 for (B, || • ||) = (L 2 (M r ), II ' lb)- Consequently, 
under the assumption that k < e 2 (v x )~ 2 + 1, it can be seen 
from our construction of the Doob martingale that 



Pr 



k 

||E*P^-|| 2 >elNl2 

< Pr 

( e ) „ 



A' 



(llAffc-i-Molla > (e-4\/fc^l)||z|| 2 |^) 



< e exp 



Mil 



fc-i 

£=1 



(20) 



where (e) follows from Banach-space-valued Azuma's in- 
equality stated in the Appendix. Further, it can be established 
using ( TT~8T > through tedious algebraic manipulations that 



fc-i 



16 



4fc 2 16fc 



(to — fc) 2 m — k 



04) 2 IWI= 



< 4(2 + (ci 



(21) 



where (/) follows from the condition k < mjc\. Combining 
all these facts together, we finally obtain from ( |20l > and (EH 
the following concentration inequality: 

Pr(||(^n-/H| 2>00 >6||z| 

(9) 



< kPv(\\J2xl x -j z j\\2^ e W z h) 

J'=l 

fc E Pr (II E^^lla ^ c||2|| 2 |^) Pr(^) 



Jl) / 2 

< e 2 kcxp( — C2 (e — v g x \/k — l) {[i 9 x )~ 



(22) 



where C4 := co/4(2 + (ci — l) -1 ) 2 , (p) follows from the 
union bound and the fact that 7r;'s are identically distributed, 
while (h) follows since 71^ has a uniform distribution over 
the set [to]. ■ 



Lemma 2. Fix c\ > 2 an<i e G (0, 1). Next, assume k < 
min{e 2 (i/^)~ 2 , m}, and let II = (tti, . . . , 7Tfe) anaf II C = 
(ffe+i j • • • j ^m) denote the first k elements and the last (m — 
fc) elements of a random permutation of \m\, respectively. 
Then for any fixed rkxl group vector z 



where (a) is primarily due to 
We have now established that \\Me — M^-ilU < at with 



X$X V \\ 2 < 2^- 



T .. 



a e := 2^ x {\\z e \\ 2 + 



j>l\\ z j\\2 



Pr X'X 



n^nz||. 



>e z 



< e 



\m - k) exp ( - c 5 (e - 4^) 2 (Mx)" 2 ) . (23) 
where C5 := ^^ggs is an absolute constant. 

Proof. The proof of this lemma is similar to that of Lemma[T] 
and also relies on Proposition Q] in the Appendix. To begin, 
we use 7r c to denote the i-th element of IF and note 



(29) 

m — I — 1 / 

The final bound we need in order to utilize Proposition Q] is 
that on || Mo || 2- Similar to ( fT9l in Lemma [T] however, it is 
straightforward to show that 1 1 A^o 1 1 2 < v 9 x Vk\\z \\ 2 . 

It now follows from our construction of the Doob martin- 
gale, Proposition Q] in the Appendix, [23, Lemma B.l] and 
the assumption k < e 2 (v x )~ 2 that 



Pr(||^XjX 7r .^|| 2 >e||z|| 2 |^) 



|^ncX n z|| 2 ,oo = max || ^X^cX nj Zj 

ie[m — k} — 1 



(24) 



< Pr 



Mfc-Mo L > (e 



x 



Vk)\\z\\ 2 



We next fix an i G \m — k\ and define A\ := {7r 2 c = z'} for 
i' G [to — fc]. Then conditioned on we again have the 
following simple equality: 

k 



< e exp 



Co £ 



'X 



Vk) 2 \\z\\ 



k 



(30) 



Pr(||^^J^%|| 2 >e|kl|2 



In addition, it can be shown using (f29) and the assumption 
k < m/a that ELi«' < 4(1 + (ci - irTK^IMII- 
Combining all these facts together, we obtain the claimed 
result as follows: 



Pr(||^X^z J || 2 > e ||z|| 2 |^). (25) p r (|| X Tx n z|| 2oo > e ||zj| 2 ) 
j=i V ,00 y 



In order to upper bound (l25l l using Proposition Q] we now 
construct an M r -valued Doob martingale (Mo, Mi, . . . , M^) 
on ''p-XJX^.Zj as follows: 

k 

M := ^ -X?E[X, |^J] z„ and 

i=i 
fc 

M/ = XjE[X^ W^i,A[] zj, I = 1, . . . , fc, 
j'=i 

where 7Ti_^ denotes the first £ elements of IT. The next step 
in the proof involves showing ||Mg — Me_i|| 2 is bounded for 
all t G [fc]. To do this, we define 

k 

M e (u) =Y,X^E[X nj \n 1 ^ e -i,^ = u,A[]zj (26) 
3=1 

for it G [A;] and once again resort to the argument in 
Lemmafflthat ||M<-Af<_i|| 2 < sup Ui „ ||M*(u) -Af<(«)|| 2 . 
Further, we define annxr random matrix 

X^f :=E[X 7rj |7ri^_i,7r^ = u,^] 

-E[X nj \w 1 ^- 1 ,TT i = v,A' i \ (27) 

and notice that X^'j = for j < I, X%'j = X u — X v for 
j = £, and X%>? = ^3t(^u - X v ) for j > L It then 
follows from this discussion that 

k 

\\M t (u) - M t (v)\\ 2 < £ II^^WMa 
i=i 

11 . Ei>/ll*jlla' 



(6) 



< (m-fc)Pr(H 



> el Ul 



3=1 



i' = l 



(m - k)J2 Pr (|| E^/J 2 > 442 A) P*(A'i) 



< 2/4 



1 



(28) 



< e 2 (m - k) exp ( - c 3 (e - 4Vfc) 2 (^)- 2 ) , (31) 

where c 5 := c /4(l + (ci — l) -1 ) 2 , (6) follows from the 
union bound and the fact that 7if's are identically distributed, 
while (c) follows since wf has a uniform distribution over 
the set fml ■ 



B. Proof of Theorem Q] 

Define K := |j 6 K : ||/?°|| 2 > C3^||/3 || 2 V^g^ 
In order to prove this theorem, we need to understand the 
behavior of the marginal correlations corresponding to the re- 
stricted model K, and the marginal correlations corresponding 
to the complement set JC C . To this end, recall the definition 
of L from the statement of the theorem and note that 

min \\U\\ 2 = min + (Xj ' X K pl - $)h 

> min II^Ha - max \\{X?X K & - /3?)||a 

= II^)I|2-||(^Xk;-I)^|| 2 ,oc. 
In addition, we trivially have 

max||/ ? j| 2 = m^\\XjX K pl || 2 = ||X&X,c/3&|| 2 , 



(32) 



(33) 






(a) Plots of p? x v'log m (b) Plots of v„ (solid) and ^P x yj r log m/n (dashed) (c) Plots of 1 — N D P for GroTh 

Fig. 1. Numerical experiments validating main result of the paper. Together, (a) and (b) illustrate that the set of design matrices satisfying the group 



coherence property is not empty. Further, (c) illustrates that the performance of GroTh is not exactly a function of the dynamic range 
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Fig. 2. Comparison between the performances of GroTh and thresholding 
of individual marginal correlations that ignores the grouping of predictors. 



It is easy to argue using d32b and d33l that 

||/?° L) || 2 > \\(X%X K /)/3£|| 2 ,oo + WX^Xzfo || 2iOC 

(34) 

is a sufficient condition for the proof of the theorem. To see 
this, note that (OH) implies riling ||/. t || 2 > max^c Wfih- 
This in turn means K, C K., since L < k, resulting in 
FDP(£) < 1 - L/k and NDP(£) < 1 - L/k. 

The next step in the proof is therefore establishing that 
the sufficient condition d34b holds in our case. It is easy to 
show using Lemmas [JJ-[2] and the union bound that 

\\(X£X K - J)$|| 3l00 + \\X%cX>c& || 3l oo > 6||j8°||a (35) 



c 4 (e 



i} for ci > 2 and e € 



with probability S < e 2 m exp 
as long as k < mm{e (v x ) . 

(0,1). We now fix e = c 3 fi x ylogm and claim that 
holds with probability 5 < e 2 ?Ti _1 under the assumptions 
of the theorem. Notice that validity of this claim implies 
the sufficient condition (f34-b holds with probability 1 — 5 > 
1 - e 2 m~ 1 as long as |j/3° L) || 2 > c 3 fi x \\j3 \\ 2 y/fogm. 



In order to complete the proof, we therefore need only 
establish the claim that ( l35l l holds for e = c$ii 9 x \/\og m 



with probability 5 < 



In this regard, note: (i) e < 1 



because of ( IGroCP-ll ) with c M < Cg 1 and (if) \fkv g x < c 2 e 
because of c\rk < n and (!GroCP-21 i with c v < ^/cTc 2 C3. 
It then follows that (f35T > holds for e = cs/x^yTog m with 
probability (5 < e 2 ™ 1- ^ 1 ^ 2 * 1 C3 . The proof now trivially 



follows by noting that 04(1 
value of C3. 



c 2 



.2 _2 



2 for the chosen 



IV. Numerical Results 

In this section, we report the outcomes of some numerical 
experiments that validate Theorem Q] The n x p matrix X 
in all these experiments is created as follows. First, we 
generate m of n x r matrices Xj whose entries are drawn 
independently from a standard normal distribution. Next, we 
use the Gram-Schmidt process to orthonormalize Xi's and 
stack the resulting orthonormal Xj's into annxp design 
matrix X. 

The first set of experiments reported in Fig. [TJa) and 
Fig- Eb) confirms that the set of design matrices satisfying 
the group coherence property is not empty. Specifically, 
Fig. 02 a) plots n x \/log m as a function of m for p = 20000 
and four different values of n. It can be seen from this figure 
that n 9 x \f\og m = 0(1), which verifies (IGroCP-U . Further, 
Fig- Gib) plots both v 9 x (solid lines) and [i g x J r log m/n 
(dashed lines) as a function of n for p = 20000 and four 
different values of m. It can be seen from this figure that 



0(n x y/r log m/n), which verifies (IGroCP-21 ). 



The second set of experiments reported in Fig. [TJc) con- 
firms that the performance of GroTh is not exactly a function 
of the dynamic range. In these experiments, corresponding 
to p = 15000, n = 3000 and r = 12, all but one group 
of nonzero regression coefficients {f3®}i<zic are normalized 
to have unit £2 norms, while one randomly selected group 
of nonzero regression coefficients is normalized to yield 
specified dynamic range. Fig. [Tic) plots 1 — NDP (averaged 
over 500 random realizations of the true model JC) for GroTh 
under this setup as a function of rk for four different values 
of dynamic range. It can be seen from this figure that the 
performance of GroTh indeed does not change with the 



dynamic range, because of the reasons outlined earlier in 
Section [II] 

The final set of experiments reported in Fig. [2] illus- 
trates that GroTh performs better than thresholding of the 
individual marginal correlations that ignores the grouping 
of predictors. In these experiments, corresponding to p = 
15000 and n = 3000, all groups of nonzero regression 
coefficients {/3?}i S K: have unit £2 norms, but individual 
nonzero regression coefficients do not necessarily have same 
magnitudes. Fig. [2] plots FDP and 1 — NDP (averaged over 
500 random realizations of the true model JC) for both GroTh 
and (individual) thresholding under this setup as a function 
of rk for three different values of r. It can be seen from this 
figure that thresholding of individual marginal correlations 
performs almost identically for different r. Performance of 
GroTh, on the other hand, improves with an increase in r. 

V. Conclusions 

In this paper, we have provided a comprehensive un- 
derstanding of Group Thresholding (GroTh) for high- 
dimensional group model selection. In particular, we have 
established that the performance of GroTh can be character- 
ized in terms of a global geometric property of the design 
matrix that is explicitly verifiable in polynomial time. Results 
reported in this paper have also enhanced our understanding 
of thresholding-based approaches in high-dimensional lin- 
ear models that rely on marginal correlations between the 
predictors and the response variable. In the future, we plan 
on extending this work by deriving fundamental bounds on 
worst-case and average group coherences, providing explicit 
examples of design matrices that satisfy the group coherence 
property, understanding the effects of modeling error, and 
relaxing the assumption of orthonormal groups of predictors. 

Appendix 

Banach-Space- Valued Azuma's Inequality 

In this appendix, we state a Banach-space-valued concen- 
tration inequality from [24] that is central to this paper. 

Proposition 1 (Banach-Space-Valued Azuma's Inequality). 

Fix s > and assume that a Banach space (B, \\ ■ ||) satisfies 



Pb(t) 



sup 

u,v£B 



U + TV\\ + It — TV 



1 > < ST 2 



for all t > 0. Let {Mk} < ^' =n be a B-valued martingale 
satisfying the pointwise bound \\Mk — Mk~i\\ < for all 
k £ N, where {ak}'kLi is a sequence of positive numbers. 
Then for every S > and k € N, we have 



Pr(||M fc -M || >5)<e* 



exp 



c S 2 



E 



where cq 



^=rr is an absolute constant. 
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Remark 1. Theorem 1.5 in [24] does not explicitly specify 
Co and also states the constant in front of exp(-) to be e s+2 . 
Proposition Q] stated in its current form, however, can be 
obtained from the proof of Theorem 1.5 in [24]. 
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